The Great Leap: From Math to Experience
In the realm of Dynamic Programming, we were like gods with a map; we knew the exact probability $p(s', r | s, a)$ of every possible wind gust and terrain shift. But the real world rarely provides such a map. Monte Carlo (MC) methods represent a fundamental shift in philosophy: we stop calculating expectations over models and start learning from sampled experience.
The Mechanics of Sampling
MC methods utilize the interaction of policy evaluation and policy improvement within the Generalized Policy Iteration (GPI) framework. Instead of bootstrapping from neighboring estimates, we play an episode to the very end and calculate the actual return $G_t$.
- First-visit MC: We only average the return following the first time a state is encountered in an episode.
- Every-visit MC: We average returns from all encounters. Both converge to $V_\pi$ as data scales.
- No Bootstrapping: Because estimates are based on independent terminal returns, MC is robust against non-Markovian dynamics where current states might hide past secrets.